{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 有限内存的 Kennard-Stone 采样方法实现"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 创建时间：2021-01-31"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们继续上一篇文档，对更大数据量，导致无法储存 $(n_\\mathrm{sample}, n_\\mathrm{sample})$ 距离矩阵时，可以采用的 Kennard-Stone 采样策略。\n",
    "\n",
    "该方法的计算复杂度是 $O(n_\\mathrm{sample}^2 n_\\mathrm{feature} + n_\\mathrm{sample} n_\\mathrm{result} n_\\mathrm{feature})$，其中 $n_\\mathrm{result}$ 为采样数，$n_\\mathrm{feature}$ 为特征向量的长度。如果将 $n_\\mathrm{feature}$ 当做常数值，那么该方法大约是代价比较大的 $O(n_\\mathrm{sample}^2)$ 方法。\n",
    "\n",
    "Kennard-Stone Algorithm 的原始文献是 Kennard, Stone [^Kennard-Stone.T.1969]。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这篇文档的特色是：\n",
    "\n",
    "- 可以在任意内存大小限制下，找到两个欧氏距离 (Euclidean Distance) 最远的样本点；\n",
    "\n",
    "- 上述算法可以通过 multiprocessing 实现，多核机器的效率会比先前的方法高；\n",
    "\n",
    "- 通过 on-the-fly 计算距离的方式，在 Kennard-Stone 采样过程中不需要使用完整的距离矩阵；\n",
    "\n",
    "- 对于 Kennard-Stone 的 C 语言实现，一样可以通过 OpenMP 实现并行化。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "需要注意，内存有限的意思**并非任意小的内存都可以实现算法**，而是在至少能储存原始数据集 $n_\\mathrm{sample} n_\\mathrm{feature}$ 大小之外，还有适量空余内存，其内存大小也是 $O(n_\\mathrm{sample})$ 量级。具体的讨论会放在后文中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-01-31T12:08:19.101292Z",
     "iopub.status.busy": "2021-01-31T12:08:19.100878Z",
     "iopub.status.idle": "2021-01-31T12:08:20.295541Z",
     "shell.execute_reply": "2021-01-31T12:08:20.293639Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from multiprocessing import Pool\n",
    "\n",
    "np.set_printoptions(precision=6, linewidth=120, suppress=True)\n",
    "np.random.seed(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Python 实现"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 随机数据集"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- `n_sample` $n_\\mathrm{sample}$ 为样本数。现在定为 20000。\n",
    "\n",
    "- `n_feature` $n_\\mathrm{feature}$ 为特征数 (描述样本性质的数据)。现在定为 100。\n",
    "\n",
    "- `X` $\\mathbf{X}$ 或 $x_{ia}$ 为完整数据集。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-01-31T12:08:20.299809Z",
     "iopub.status.busy": "2021-01-31T12:08:20.299417Z",
     "iopub.status.idle": "2021-01-31T12:08:20.348202Z",
     "shell.execute_reply": "2021-01-31T12:08:20.347769Z"
    }
   },
   "outputs": [],
   "source": [
    "n_sample  = 20000\n",
    "n_feature = 100\n",
    "X = np.random.randn(n_sample, n_feature)\n",
    "X *= 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 找到距离最远的两个样本"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们仍然需要生成距离矩阵，来找到距离最远的两个样本。但这个过程不一定需要储存完整的距离矩阵 $d_{ij}$，而是分块计算，最后统合。\n",
    "\n",
    "我们考虑多进程分块。如果进程数量是 $n_\\mathrm{proc}$，在每个进程中计算的距离矩阵分块是 $(n_\\mathrm{batch}, n_\\mathrm{batch})$，那么为了计算距离分块需要 $n_\\mathrm{proc} n_\\mathrm{batch}^2$。同时，每个分块的最大值需要进行存储，因此需要 $n_\\mathrm{sample}^2 / n_\\mathrm{batch}^2$ 的内存 (这块内存尽管可以节省，但为了程序容易编写，就保留了这部分内存)。因此，总内存需要\n",
    "\n",
    "$$\n",
    "n_\\mathrm{proc} n_\\mathrm{batch}^2 + n_\\mathrm{sample}^2 / n_\\mathrm{batch}^2\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "根据三角不等式，最少的内存需求在 $n_\\mathrm{batch} = n_\\mathrm{proc}^{-1/4} \\sqrt{n_\\mathrm{sample}}$ 时成立，即 $2 \\sqrt{n_\\mathrm{proc}} n_\\mathrm{sample}$。\n",
    "\n",
    "但如果单从效率上考虑，每个分块越大，那么多进程的通讯次数越少，消耗时间也就越少。这里我们选用 $n_\\mathrm{batch}$ 为 1000。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-01-31T12:08:20.353725Z",
     "iopub.status.busy": "2021-01-31T12:08:20.353315Z",
     "iopub.status.idle": "2021-01-31T12:08:20.361466Z",
     "shell.execute_reply": "2021-01-31T12:08:20.361003Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "20"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "n_batch = 1000\n",
    "t = np.einsum(\"ia, ia -> i\", X, X)\n",
    "\n",
    "def get_dist_slice(X, t, sliceA, sliceB):\n",
    "    distAB = t[sliceA, None] - 2 * X[sliceA] @ X[sliceB].T + t[None, sliceB]\n",
    "    if sliceA == sliceB:\n",
    "        np.fill_diagonal(distAB, 0)\n",
    "    return np.sqrt(distAB)\n",
    "\n",
    "def get_slices(n_sample, n_batch):\n",
    "    p = list(np.arange(0, n_sample, n_batch)) + [n_sample]\n",
    "    return [slice(p[i], p[i+1]) for i in range(len(p) - 1)]\n",
    "\n",
    "def get_maxloc_slice(slice_pair):\n",
    "    dist_slice = get_dist_slice(X, t, slice_pair[0], slice_pair[1])\n",
    "    max_indexes = np.unravel_index(np.argmax(dist_slice), dist_slice.shape)\n",
    "    return (dist_slice[max_indexes], max_indexes[0] + slice_pair[0].start, max_indexes[1] + slice_pair[1].start)\n",
    "\n",
    "slices = get_slices(n_sample, n_batch)\n",
    "n_slices = len(slices)\n",
    "slice_pairs = [(slices[i], slices[j]) for i in range(n_slices) for j in range(n_slices) if i <= j]\n",
    "n_slices"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最终得到的两个样本就如下所述："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-01-31T12:08:20.364938Z",
     "iopub.status.busy": "2021-01-31T12:08:20.364547Z",
     "iopub.status.idle": "2021-01-31T12:08:21.747106Z",
     "shell.execute_reply": "2021-01-31T12:08:21.746687Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 21.6 ms, sys: 8.2 ms, total: 29.8 ms\n",
      "Wall time: 1.38 s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(7794, 18772)"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "with Pool(4) as p:\n",
    "    maxloc_slice_list = p.map(get_maxloc_slice, slice_pairs)\n",
    "max_indexes = maxloc_slice_list[np.argmax([v[0] for v in maxloc_slice_list])][1:]\n",
    "max_indexes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Kennard-Stone 采样：Python 程序"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "事实上，此处的 Kennard-Stone 采样过程与上一篇文档完全一致，只是在所有需要索引距离矩阵处都进行了现场计算的工作而已。由于现场计算距离额外地引入了 $n_\\mathrm{feature}$ 的维度，因此此步的计算复杂度是 $O(n_\\mathrm{sample} n_\\mathrm{result} n_\\mathrm{feature})$；但内存复杂度没有变化，仍然是 $O(n_\\mathrm{sample})$。\n",
    "\n",
    "尽管算法没有发生太大变化，但距离矩阵现算导致耗时会大量增加，并且要考虑到额外增加的变量与函数通讯过程。因此它可以看做是代价较大的复杂度 $O(n_\\mathrm{sample} n_\\mathrm{result})$。如果内存充足，实际上不是那么建议使用此方法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-01-31T12:08:21.753989Z",
     "iopub.status.busy": "2021-01-31T12:08:21.753569Z",
     "iopub.status.idle": "2021-01-31T12:08:21.755645Z",
     "shell.execute_reply": "2021-01-31T12:08:21.755287Z"
    }
   },
   "outputs": [],
   "source": [
    "def ks_sampling_core_mem(X, seed, n_result):\n",
    "    # Definition: Output Variables\n",
    "    result = np.zeros(n_result, dtype=int)\n",
    "    v_dist = np.zeros(n_result, dtype=float)\n",
    "    \n",
    "    # Definition: Intermediate Variables\n",
    "    n_seed = len(seed)\n",
    "    n_sample = X.shape[0]\n",
    "    min_vals = remains = None\n",
    "    \n",
    "    # --- Initialization ---\n",
    "    def sliced_dist(idx):\n",
    "        tmp_X = X[remains] - X[idx]\n",
    "        return np.sqrt(np.einsum(\"ia, ia -> i\", tmp_X, tmp_X))\n",
    "\n",
    "    selected = [False] * n_sample\n",
    "    remains = []\n",
    "    for i in range(n_sample):\n",
    "        if i not in seed:\n",
    "            remains.append(i)\n",
    "    result[:n_seed] = seed\n",
    "    if n_seed == 2:\n",
    "        v_dist[0] = np.linalg.norm(X[seed[0]] - X[seed[1]])\n",
    "    min_vals = sliced_dist(seed[0])\n",
    "    \n",
    "    for n in seed:\n",
    "        np.min(np.array([min_vals, sliced_dist(n)]), axis=0, out=min_vals)\n",
    "    # --- Loop argmax minimum ---\n",
    "    for n in range(n_seed, n_result):\n",
    "        sup_index = min_vals.argmax()\n",
    "        result[n] = remains[sup_index]\n",
    "        v_dist[n - 1] = min_vals[sup_index]\n",
    "        remains.pop(sup_index)\n",
    "        min_vals[sup_index:-1] = min_vals[sup_index+1:]\n",
    "        min_vals = min_vals[:-1]\n",
    "        np.min(np.array([min_vals, sliced_dist(result[n])]), axis=0, out=min_vals)\n",
    "    return result, v_dist"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面给出抽样 2000 个样本时的纯 Python 程序执行过程。可以看出耗时非常明显。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-01-31T12:08:21.767487Z",
     "iopub.status.busy": "2021-01-31T12:08:21.767081Z",
     "iopub.status.idle": "2021-01-31T12:08:33.963376Z",
     "shell.execute_reply": "2021-01-31T12:08:33.962954Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 11.9 s, sys: 300 ms, total: 12.2 s\n",
      "Wall time: 12.2 s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(array([ 7794, 18772, 11049, ..., 14861, 17154,   733]),\n",
       " array([2004.309858, 1794.599652, 1756.059579, ..., 1283.925941, 1283.855823,    0.      ]))"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "ks_sampling_core_mem(X, max_indexes, 2000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Kennard-Stone 采样：C 程序"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于 C 语言的程序，其调用与上一篇文档类似，但作为 `seed` 的关键词不再是可选参数了。对于默认采样方式而言，用户需要自行提供最远处的两个样本序号。\n",
    "\n",
    "对所有 20000 个样本采样，可以在 10 秒以内完成。其速度显然不如全部距离矩阵元素都能立即获得的算法，但仍然是可以接受的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-01-31T12:08:33.966165Z",
     "iopub.status.busy": "2021-01-31T12:08:33.965778Z",
     "iopub.status.idle": "2021-01-31T12:08:34.286885Z",
     "shell.execute_reply": "2021-01-31T12:08:34.287878Z"
    }
   },
   "outputs": [],
   "source": [
    "from KS_Sampling import ks_sampling_mem_core_cpp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-01-31T12:08:34.295005Z",
     "iopub.status.busy": "2021-01-31T12:08:34.294529Z",
     "iopub.status.idle": "2021-01-31T12:08:35.174666Z",
     "shell.execute_reply": "2021-01-31T12:08:35.175239Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 3.5 s, sys: 0 ns, total: 3.5 s\n",
      "Wall time: 878 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(array([ 7794, 18772, 11049, ..., 14861, 17154,   733]),\n",
       " array([2004.309937, 1794.599731, 1756.059448, ..., 1283.925903, 1283.855835,    0.      ]))"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "ks_sampling_mem_core_cpp(X, max_indexes, 2000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-01-31T12:08:35.177074Z",
     "iopub.status.busy": "2021-01-31T12:08:35.176666Z",
     "iopub.status.idle": "2021-01-31T12:08:40.324515Z",
     "shell.execute_reply": "2021-01-31T12:08:40.325055Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 20.6 s, sys: 0 ns, total: 20.6 s\n",
      "Wall time: 5.14 s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(array([ 7794, 18772, 11049, ...,  2941,  7521, 17265]),\n",
       " array([2004.309937, 1794.599731, 1756.059448, ...,  881.746399,  859.787659,    0.      ]))"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "ks_sampling_mem_core_cpp(X, max_indexes, 20000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 实例演示：QM9 数据集 CM 特征的 Kennard-Stone 采样"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "作为一个可以现实会遇到的问题，我们对化学分子中使用的 QM9 数据集 (131k 个分子) 的 CM (Coulumb Matrix) 特征进行 QM9 采样。其数据来源是下述文章的补充信息：\n",
    "\n",
    "- Faber, F. A., et al; \\*Lilienfeld, O. A. v. Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error, J. Comput. Theory Chem. **2017**, *13* (11), 5255-5264. doi: 10.1021/acs.jctc.7b00577\n",
    "\n",
    "这里使用 Python 分块找到最远两点、C 语言执行 Kennard-Stone 算法的方式，进行样本的选取。\n",
    "\n",
    "使用的程序与上文基本是相同的，但这些功能都已经整合到 {download}`KS_Sampling.py` 文件中的函数 `ks_sampling_mem` 了。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "与上一篇文档相似地，我们仍然用 4 核 CPU 计算。\n",
    "\n",
    "这里所报出的警告表明存在一些分子间距离过小，导致程序产生数值误差，对非常小的负值求开方。大多数情况下，这不会导致很严重的错误。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-01-31T12:08:40.327684Z",
     "iopub.status.busy": "2021-01-31T12:08:40.326903Z",
     "iopub.status.idle": "2021-01-31T12:08:40.330533Z",
     "shell.execute_reply": "2021-01-31T12:08:40.331072Z"
    }
   },
   "outputs": [],
   "source": [
    "from KS_Sampling import ks_sampling_mem\n",
    "import numpy as np\n",
    "\n",
    "np.set_printoptions(precision=6, linewidth=120, suppress=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-01-31T12:08:40.333564Z",
     "iopub.status.busy": "2021-01-31T12:08:40.332794Z",
     "iopub.status.idle": "2021-01-31T12:08:40.336030Z",
     "shell.execute_reply": "2021-01-31T12:08:40.336561Z"
    }
   },
   "outputs": [],
   "source": [
    "n_sample = 130829\n",
    "n_feature = 900"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-01-31T12:08:40.338975Z",
     "iopub.status.busy": "2021-01-31T12:08:40.338196Z",
     "iopub.status.idle": "2021-01-31T12:08:59.098706Z",
     "shell.execute_reply": "2021-01-31T12:08:59.098214Z"
    }
   },
   "outputs": [],
   "source": [
    "X = np.empty((n_sample, n_feature), dtype=np.float32)\n",
    "with open(\"CM\", \"r\") as f:\n",
    "    for i in range(n_sample):\n",
    "        X[i] = np.array(f.readline().split()[1:], dtype=np.float32)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-01-31T12:08:59.101519Z",
     "iopub.status.busy": "2021-01-31T12:08:59.101119Z",
     "iopub.status.idle": "2021-01-31T12:52:44.761598Z",
     "shell.execute_reply": "2021-01-31T12:52:44.762007Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/a/Documents/2020-01-30-KS_Memory/KS_Sampling.py:135: RuntimeWarning: invalid value encountered in sqrt\n",
      "  return np.sqrt(distAB)\n",
      "/home/a/Documents/2020-01-30-KS_Memory/KS_Sampling.py:135: RuntimeWarning: invalid value encountered in sqrt\n",
      "  return np.sqrt(distAB)\n",
      "/home/a/Documents/2020-01-30-KS_Memory/KS_Sampling.py:135: RuntimeWarning: invalid value encountered in sqrt\n",
      "  return np.sqrt(distAB)\n",
      "/home/a/Documents/2020-01-30-KS_Memory/KS_Sampling.py:135: RuntimeWarning: invalid value encountered in sqrt\n",
      "  return np.sqrt(distAB)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2h 49min 3s, sys: 8.04 s, total: 2h 49min 11s\n",
      "Wall time: 43min 45s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "QM9_CM_KS_result = ks_sampling_mem(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-01-31T12:52:44.764200Z",
     "iopub.status.busy": "2021-01-31T12:52:44.763787Z",
     "iopub.status.idle": "2021-01-31T12:52:44.768356Z",
     "shell.execute_reply": "2021-01-31T12:52:44.768756Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([  2858,   3284,  99137, ...,  67127,  64051, 103213]),\n",
       " array([  0.010785, 207.073349, 200.928101, ...,   0.000006,   0.000005,   0.      ]))"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "QM9_CM_KS_result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[^Kennard-Stone.T.1969]: Kennard, R. W.; Stone, L. A. Computer Aided Design of Experiments. *Technometrics* **1969**, *11* (1), 137–148. doi: [10.1080/00401706.1969.10490666](https://doi.org/10.1080/00401706.1969.10490666)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
